In [1]:
using DataFrames
using Distributions
using Gadfly
using Optim
In [2]:
T = [0, 1, 1, 0, 2, 3, 0, 1, 1, 1, 1, 2, 0, 1, 3, 0, 1, 2, 1, 3, 3, 4, 1, 3, 2, 0];
C = [2, 0, 3, 0, 0, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1, 0, 2];
Under the null hypothesis, $H_0$, $\theta_{\text{T}} = \theta_{\text{C}} = \theta$, while under the alternative hypothesis, $H_1$, $\theta_{\text{T}} \neq \theta_{\text{C}}$.
If $H_0$ is true, the log-likelihood function for the counts is
$$\ell_0 = \ell(\theta; \mathbf{y}) = \sum_i (\text{T}_i \log \theta - \theta - \log \text{T}_i !) + \sum_i (\text{C}_i \log \theta - \theta - \log \text{C}_i !).$$The maximum likelihood estimate is
$$\hat{\theta}_0 = \frac{\sum_i \text{T}_i + \sum_i \text{C}_i}{\lvert \text{T}\rvert + \lvert \text{C}\rvert}.$$
In [3]:
θ̂₀ = (sum(T) + sum(C)) / (size(T, 1) + size(C, 1))
Out[3]:
In [4]:
ℓ(θ, Y) = mapreduce((y) -> y * log(θ) - θ - log(factorial(y)), +, 0, Y);
In [5]:
ℓ₀(θ) = ℓ(θ, T) + ℓ(θ, C);
In [6]:
ℓ̂₀ = ℓ₀(θ̂₀)
Out[6]:
If $H_1$ is true, the log-likelihood for the counts is
$$\ell_1 = \ell(\theta_\text{T}, \theta_\text{C}; \mathbf{y}) = \sum_i (\text{T}_i \log \theta_\text{T} - \theta_\text{T} - \log \text{T}_i !) + \sum_i (\text{C}_i \log \theta_\text{C} - \theta_\text{C} - \log \text{C}_i !) .$$The maximum likelihood estimates are
$$\hat{\theta}_{1\text{T}} = \frac{\sum_i \text{T}_i}{\lvert \text{T}\rvert}$$and
$$\hat{\theta}_{1\text{C}} = \frac{\sum_i \text{C}_i}{\lvert \text{C}\rvert}.$$
In [7]:
θ̂₁t = sum(T) / size(T, 1)
Out[7]:
In [8]:
θ̂₁c = sum(C) / size(C, 1)
Out[8]:
In [9]:
ℓ₁(θt, θc) = ℓ(θt, T) + ℓ(θc, C);
In [10]:
ℓ̂₁ = ℓ₁(θ̂₁t, θ̂₁c)
Out[10]:
The maximum value of the log-likelihood function $\ell_1$ will always be greater than or equal to that of $\ell_0$ as one more parameter has been fitted. To decide whether the difference is statistically significant, we need to know the sampling distribution of the log-likelihood function.
[Discussed later.]
In [11]:
xs = linspace(0, 2.5)[2:end]
ys = Array{Float64}(size(xs))
for i in eachindex(xs)
ys[i] = ℓ₀(xs[i])
end
plot(x = xs, y = ys, Geom.line)
Out[11]:
In [12]:
θ̃₀ = optimize((θ) -> -ℓ₀(θ), 0, 2).minimum
Out[12]:
In [13]:
θ̃₀ = θ̂₀
Out[13]:
In [14]:
d₀ = Poisson(θ̂₀)
d₁t = Poisson(θ̂₁t)
d₁c = Poisson(θ̂₁c);
In [15]:
x = 10
xs = 0:x
Ht = hist(T, -1:x)[2]
Hc = hist(C, -1:x)[2]
Htc = DataFrame(
count = xs,
T = Ht, C = Hc,
T̂₀ = pdf(d₀, xs) .* length(T), C_hat_0 = pdf(d₀, xs) .* length(T),
T̂₁ = pdf(d₁t, xs) .* length(T), Ĉ₁ = pdf(d₁c, xs) .* length(T)
)
Out[15]:
In [16]:
Htcₗ = stack(Htc, collect(2:7))
Out[16]:
In [17]:
plot(
Htcₗ[(col -> (col .== :C) | (col .== :C_hat_0))(Htcₗ[:variable]), :],
x = :count, y = :value, color = :variable,
Geom.bar(position = :dodge)
)
Out[17]:
In [18]:
Htcₗ[(col -> (col .== :C) | (col .== :C_hat_0))(Htcₗ[:variable]), :]
Out[18]: